BIO260 Final Project


Background and Motivation

As one of our most used web tools, Yelp provides us with useful information on restaurants’ hours, food type, and price range. More importantly, we are able to instantly compare restaurants through user reviews that are conveniently displayed in a 5 star scale. In this project, we aim to use the skills we learned throughout the course to


Obtaining Data: Web Scrapping and Wrangling

We will first scrape the data from the Yelp website. We used a Google Chrome extension called WebScraper to obtain our initial dataset.

In total, we scrapped data for 1000 restaurants in Boston, which appeared in the first 10 pages of search results.

library(stringr) ## for str_split_fixed
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr) ## for separate function
library(knitr)

Here, we import the data obtained from WebScrapper.

## loading data from WebScrapper
data <- read.csv("Yelp_data.csv", comment.char="#")

## Remove website address from the output dataset
data <- data[,-c(2,4)] 

However, the data obtained by the WebScraper is quite messy and required extensive wrangling. For example, no overall rating scores were reported, and attributes and result were in the same unit. So further data cleaning are required.

## Data directly scrapped from Yelp were not good enough for analysis, so we will do some wrangling. Here we can see the attributes and results are both presented as columns. 
data[1,30] 
## [1] Ambience Casual
## 42 Levels: Alcohol Beer & Wine Only Alcohol Full Bar ... Wi-Fi Paid
## we need to seperate result from attribution

## Separate each attribution unit by " " (space) for columns
data <- data %>% separate(col=column5, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column6, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column7, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column8, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column9, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column10, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column11, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column12, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column13, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column14, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column15, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column16, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column17, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column18, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column19, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column20, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column21, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right") 
data <- data %>% separate(col=column22, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right")
data <- data %>% separate(col=column23, into=c("attribute#1","attribute#2","attribute#3","attribute#4"), sep=" ", fill="right")

After separating the columns, we will now use pattern recognition to identify the value for the Wi-Fi attribute as an example.

## Machine Learning codes
## Locate where the word "Wi-Fi" is and take the right unit as result for wifi. 
data$wifi <- "No"
for (i in 1:1000) {
  for (j in 21:96) {
    ifelse(data[i,j] %in% "Wi-Fi", data$wifi[i] <- data[i,j+1], NA)
  }
}
table(data$wifi)
## 
## Free   No Paid 
##  339  655    6
## Note that there are three results: "Free", "No", "Paid", we assigned 1 to restaurant with Free or Paid wifi and 0 to those without wifi
data$wifi_rate <- ifelse(data$wifi %in% c("Free","Paid"),1,0)
table(data$wifi_rate)
## 
##   0   1 
## 655 345

Then we will apply the same strategy to the rest of the attributes.

##TV
for (i in 1:1000) {
  for (j in 21:96) {
    ifelse(data[i,j] %in% "TV", data$TV[i] <- data[i,j+1], NA)
  }
}
table(data$TV)
## 
##  No Yes 
## 505 495
data$TV_rate <- ifelse(data$TV=="Yes",1,0)
table(data$TV_rate)
## 
##   0   1 
## 505 495
##noise level
for (i in 1:1000) {
  for (j in 21:96) {
    ifelse(data[i,j] %in% "Level", data$noise[i] <- data[i,j+1], NA)
  }
}
table(data$noise)
## 
## Average    Loud   Quiet    Very 
##     799      91      91      19
data$noise_rate <- ifelse(data$noise %in% c("Loud","Very"),1,0)
table(data$noise_rate)
## 
##   0   1 
## 890 110
## alcohol
for (i in 1:1000) {
  for (j in 21:96) {
    ifelse(data[i,j] %in% "Alcohol", data$alcohol[i] <- data[i,j+1], NA)
  }
}
table(data$alcohol)
## 
## Beer Full   No 
##  191  365  444
data$alcohol_rate <- ifelse(data$alcohol %in% c("Beer","Full"),1,0)
table(data$alcohol_rate)
## 
##   0   1 
## 444 556
## seating
for (i in 1:1000) {
  for (j in 21:96) {
    ifelse(data[i,j] %in% "Seating", data$seating[i] <- data[i,j+1], NA)
  }
}
table(data$seating)
## 
##  No Yes 
## 687 313
data$seating_rate <- ifelse(data$seating %in% "Yes",1,0)
table(data$seating_rate)
## 
##   0   1 
## 687 313

For reservation, delivery, takeout, credit card attributes:

data$reservation_rate <- ifelse(data$reservation=="Yes",1,0)
data$delivery_rate <- ifelse(data$delivery=="Yes",1,0)
data$takeout_rate <- ifelse(data$takeout=="Yes",1,0)
data$creditcard_rate <- ifelse(data$creditcard=="Yes",1,0)

After cleaning the attributes information, we will also need to calculate rating value.

By using the data for how many users gave 1-5 stars for each restaurant, we are able to calculate a continuous rating for each restaurant. The final 5-star scale Yelp is simply this continuous rating rounded to nearest 0.5.

We will name this continuous variable “rate”

We also calculated the number of raters, which is simply the sum of raters in each star category.

data$rate <- (data$X5star*5 +data$X4star*4 +data$X3star*3 +data$X2star*2 +data$X1star)/(data$X5star +data$X4star +data$X3star +data$X2star +data$X1star)
data$numberrate<- data$X5star +data$X4star +data$X3star +data$X2star +data$X1star

We will then select the desired columns for further analysis.

# We will only keep the data fields we want
names(data)
##   [1] "page"             "name"             "category"        
##   [4] "address"          "X5star"           "X4star"          
##   [7] "X3star"           "X2star"           "X1star"          
##  [10] "monday"           "tuesday"          "wednesday"       
##  [13] "thursday"         "friday"           "saturday"        
##  [16] "sunday"           "reservation"      "delivery"        
##  [19] "takeout"          "creditcard"       "attribute#1"     
##  [22] "attribute#2"      "attribute#3"      "attribute#4"     
##  [25] "attribute#1.1"    "attribute#2.1"    "attribute#3.1"   
##  [28] "attribute#4.1"    "attribute#1.2"    "attribute#2.2"   
##  [31] "attribute#3.2"    "attribute#4.2"    "attribute#1.3"   
##  [34] "attribute#2.3"    "attribute#3.3"    "attribute#4.3"   
##  [37] "attribute#1.4"    "attribute#2.4"    "attribute#3.4"   
##  [40] "attribute#4.4"    "attribute#1.5"    "attribute#2.5"   
##  [43] "attribute#3.5"    "attribute#4.5"    "attribute#1.6"   
##  [46] "attribute#2.6"    "attribute#3.6"    "attribute#4.6"   
##  [49] "attribute#1.7"    "attribute#2.7"    "attribute#3.7"   
##  [52] "attribute#4.7"    "attribute#1.8"    "attribute#2.8"   
##  [55] "attribute#3.8"    "attribute#4.8"    "attribute#1.9"   
##  [58] "attribute#2.9"    "attribute#3.9"    "attribute#4.9"   
##  [61] "attribute#1.10"   "attribute#2.10"   "attribute#3.10"  
##  [64] "attribute#4.10"   "attribute#1.11"   "attribute#2.11"  
##  [67] "attribute#3.11"   "attribute#4.11"   "attribute#1.12"  
##  [70] "attribute#2.12"   "attribute#3.12"   "attribute#4.12"  
##  [73] "attribute#1.13"   "attribute#2.13"   "attribute#3.13"  
##  [76] "attribute#4.13"   "attribute#1.14"   "attribute#2.14"  
##  [79] "attribute#3.14"   "attribute#4.14"   "attribute#1.15"  
##  [82] "attribute#2.15"   "attribute#3.15"   "attribute#4.15"  
##  [85] "attribute#1.16"   "attribute#2.16"   "attribute#3.16"  
##  [88] "attribute#4.16"   "attribute#1"      "attribute#2"     
##  [91] "attribute#3"      "attribute#4"      "attribute#1"     
##  [94] "attribute#2"      "attribute#3"      "attribute#4"     
##  [97] "wifi"             "wifi_rate"        "TV"              
## [100] "TV_rate"          "noise"            "noise_rate"      
## [103] "alcohol"          "alcohol_rate"     "seating"         
## [106] "seating_rate"     "reservation_rate" "delivery_rate"   
## [109] "takeout_rate"     "creditcard_rate"  "rate"            
## [112] "numberrate"
data2 <- data[,c(1:16,98,100,102,104,106,109:112,107,108)]

# To check all the data are here
names(data2)
##  [1] "page"             "name"             "category"        
##  [4] "address"          "X5star"           "X4star"          
##  [7] "X3star"           "X2star"           "X1star"          
## [10] "monday"           "tuesday"          "wednesday"       
## [13] "thursday"         "friday"           "saturday"        
## [16] "sunday"           "wifi_rate"        "TV_rate"         
## [19] "noise_rate"       "alcohol_rate"     "seating_rate"    
## [22] "takeout_rate"     "creditcard_rate"  "rate"            
## [25] "numberrate"       "reservation_rate" "delivery_rate"

Lastly, we will calculate the actual Yelp star rating by rounding the rating column to nearest 0.5. We inspected a few restaurants and our calculations match Yelp’s website data.

data3 <- data2 %>%
  mutate(yelp_rating=(round(rate/0.5)*0.5))

Here we will download our data as a .csv file to save progress. (We will not run these codes again for knitting, so the codes are displayed but not run)

write.csv(data2, “Cleandata.csv”, row.names=FALSE)


Exploratory Data Analysis (EDA)

With the wranggled dataset, now we want to explore the data through EDA techniques

# First, we will look at the distribution of star ratings for the restaurants we got
summary(data3$yelp_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.500   4.000   3.742   4.000   5.000
count(data3, yelp_rating)
## Source: local data frame [7 x 2]
## 
##   yelp_rating     n
##         (dbl) (int)
## 1         2.0     1
## 2         2.5    17
## 3         3.0   138
## 4         3.5   338
## 5         4.0   368
## 6         4.5   123
## 7         5.0    15

It looks like majority of our ratings are 3.5-4.0. We have 15 restaurants with 5 star rating and 1 with 2 star rating.

We then want to summarize the attributes for the different rating categories

table <- data3 %>% group_by(yelp_rating) %>% summarise(page=mean(page),raters=mean(numberrate),takeout=mean(takeout_rate),alcohol=mean(alcohol_rate),noise=mean(noise_rate)) %>% ungroup()
kable(round(table, digits=2))             
yelp_rating page raters takeout alcohol noise
2.0 96.00 32.00 1.00 0.00 1.00
2.5 84.94 163.06 0.94 0.76 0.24
3.0 73.02 197.41 0.92 0.67 0.14
3.5 59.65 200.70 0.91 0.67 0.12
4.0 37.41 249.30 0.84 0.51 0.10
4.5 32.36 169.30 0.87 0.28 0.05
5.0 64.80 14.67 0.87 0.20 0.00

Here are the descriptions of the table columns:

We will now graph the data to see the trends

par(mfrow=c(2,3))

plot(table$yelp_rating,table$alcohol, main = "Alcohol", xlab = "Yelp Rating", ylab = "Percentage of Alcohol",pch=16,col="red",cex=1.5)
abline(lm(table$alcohol ~ table$yelp_rating))

plot(table$yelp_rating,table$page, main = "Page", xlab = "Yelp Rating", ylab = "Mean Page Number",pch=16,col="red",cex=1.5)
abline(lm(table$page ~ table$yelp_rating))

plot(table$yelp_rating,table$raters, main = "Raters", xlab = "Yelp Rating", ylab = "Mean number of raters",pch=16,col="red",cex=1.5)
abline(lm(table$raters ~ table$yelp_rating))

plot(table$yelp_rating,table$takeout, main = "Takeout", xlab = "Yelp Rating", ylab = "Percentage of Takeout Service",pch=16,col="red",cex=1.5)
abline(lm(table$takeout ~ table$yelp_rating))

plot(table$yelp_rating,table$noise, main = "Noise", xlab = "Yelp Rating", ylab = "Percentage of Reported Noise",pch=16,col="red",cex=1.5)
abline(lm(table$noise ~ table$yelp_rating))

Here is what we have noticed:

Note


Geographic Distribution of Yelp Ratings

We are interested in visualizing how Yelp Rating is distributed geographically in the city of Boston.

We will first obtain the geographic coordinates for each restaurants using the ggmap package.

# Convert address into coordinates using ggmap package
library(ggplot2)
library(ggmap)

# Use geocode() to obtain lat and lon
get_lat <- function(address){
  coordinates <- geocode(address, source="google", messaging=FALSE)
  return(coordinates$lat)
}

get_lon <- function(address){
  coordinates <- geocode(address, source="google", messaging=FALSE)
  return(coordinates$lon)
}

Since there is a 2500 limit for geocode() function and it takes a while to get all the coordinates, we will show the code here and load the data from an existing .csv file for knitting.

Add coordinate data to dataframe (not run) data3 <- data3 %>% mutate(lat=get_lat(as.character(address)), lon=get_lon(as.character(address)))

Save as .csv file (not run) write.csv(data3, “Cleandata_coor.csv”, row.names=FALSE)

dat <- read.csv("C:/Users/Zhi/git_dir/final_project/Cleandata_coor.csv")
dat2 <- tbl_df(dat)

We will now plot the data on map of Boston

# Map restaurants in dataset. Colors are different based on rating
map <- get_map(location='boston', source="stamen", maptype="terrain", zoom=13)
ggmap(map) + geom_point(aes(x=lon, y=lat), data=dat2, color=dat2$yelp_rating)

# Now we will map restaurants with >=4.5 stars in rating in red and rest in blue
map <- get_map(location='boston', source="stamen", maptype="terrain", zoom=13)
ggmap(map) + geom_point(aes(x=lon, y=lat), data=filter(dat2, yelp_rating>=4.5), color="red") + geom_point(aes(x=lon, y=lat), data=filter(dat2, yelp_rating<4.5), color="blue")

# We will zoom in a bit to get a better visual of downtown Boston
map <- get_map(location='boston', source="stamen", maptype="terrain", zoom=14)
ggmap(map) + geom_point(aes(x=lon, y=lat), data=filter(dat2, yelp_rating>=4.5), color="red") + geom_point(aes(x=lon, y=lat), data=filter(dat2, yelp_rating<4.5), color="blue")

There does not appear to be any obvious geographic clutering of highly rated restaurants (>=4.5 stars rating). However, it seems that East Boston has a greater number of highly rated restaurants compared to other areas.


Building regression model of Yelp Ratings data

Linear regression model

First, we will create a linear regression model to predict the continuous numberical ratings before converting to the Yelp 5 star scale.

The predictor variables we included in the model include

  • Whether the restaurants take reservation
  • The page number it appears in
  • The total number of user reviews
  • Whether it delivers
  • Whether it provides takeout service
  • Whether it takes credit card
  • Whether there is WiFi
  • Whether there is TV
  • whether there is outside seating
  • Whether it serves alcohol
  • Whether is is reported as noisy
fit <- lm(rate ~ reservation_rate + page + numberrate + delivery_rate + takeout_rate + creditcard_rate + wifi_rate + TV_rate + seating_rate + alcohol_rate + noise_rate, data=data3)
summary(fit)
## 
## Call:
## lm(formula = rate ~ reservation_rate + page + numberrate + delivery_rate + 
##     takeout_rate + creditcard_rate + wifi_rate + TV_rate + seating_rate + 
##     alcohol_rate + noise_rate, data = data3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21380 -0.21521 -0.01799  0.19853  1.60216 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.831e+00  5.895e-02  81.958  < 2e-16 ***
## reservation_rate -4.270e-02  3.018e-02  -1.415  0.15732    
## page             -8.907e-03  4.287e-04 -20.779  < 2e-16 ***
## numberrate       -2.014e-04  5.009e-05  -4.020 6.26e-05 ***
## delivery_rate    -2.812e-02  2.637e-02  -1.067  0.28642    
## takeout_rate     -2.258e-01  3.759e-02  -6.008 2.63e-09 ***
## creditcard_rate  -1.932e-01  4.293e-02  -4.501 7.56e-06 ***
## wifi_rate        -4.151e-02  2.492e-02  -1.666  0.09610 .  
## TV_rate          -6.826e-02  2.528e-02  -2.700  0.00705 ** 
## seating_rate      1.725e-02  2.489e-02   0.693  0.48846    
## alcohol_rate     -2.486e-01  3.366e-02  -7.387 3.20e-13 ***
## noise_rate       -1.167e-01  3.682e-02  -3.169  0.00157 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3577 on 988 degrees of freedom
## Multiple R-squared:  0.4028, Adjusted R-squared:  0.3961 
## F-statistic: 60.57 on 11 and 988 DF,  p-value: < 2.2e-16

It turns out that our model had a R2 of 0.403 and the following factors were significant predictors of a decrease in continuous rating:

  • The page number it appears in
  • The total number of user reviews
  • Whether it provides takeout service
  • Whether it takes credit card,
  • Whether there is TV
  • Whether it serves alcohol
  • Whether is is reported as noisy

To see how good our model is, we will fit the model using our training (Boston) dataset and see how well we can predict the ratings.

par(mfrow=c(1,1))
data3$predict <-predict(fit)
plot(data3$rate,data3$predict, main = "Boston Prediction vs. Obervation (Linear Model)", xlab = "Rate_Observation", ylab = "Rate_Prediction")

cor(data3$rate,data3$predict)
## [1] 0.6346307

Our predicted values had a 0.635 correlation with the observed values, which is not bad :).

Logistic regression model

We wanted to see whether different methods of modeling could improve our ability to predict Yelp ratings based on the attributes we collected.

In this case the outcome is whether the restaurant achieved a 4.5 or 5.0 star rating. The predictor variables are the same as our linear model.

# Next we will create a logistic regression model
data3$log_rate <- ifelse(data3$yelp_rating < 4.5,0,1)
logit <- glm(log_rate ~ page + numberrate + reservation_rate + delivery_rate + takeout_rate + creditcard_rate + wifi_rate + TV_rate + seating_rate + alcohol_rate + noise_rate, data=data3, family = "binomial")
summary(logit)
## 
## Call:
## glm(formula = log_rate ~ page + numberrate + reservation_rate + 
##     delivery_rate + takeout_rate + creditcard_rate + wifi_rate + 
##     TV_rate + seating_rate + alcohol_rate + noise_rate, family = "binomial", 
##     data = data3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4010  -0.5137  -0.3469  -0.1875   3.2862  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.1836300  0.4957771   4.404 1.06e-05 ***
## page             -0.0347856  0.0042650  -8.156 3.46e-16 ***
## numberrate       -0.0019250  0.0006471  -2.975  0.00293 ** 
## reservation_rate -0.2597043  0.3000577  -0.866  0.38676    
## delivery_rate    -0.1577415  0.2327803  -0.678  0.49800    
## takeout_rate     -1.0564195  0.3424082  -3.085  0.00203 ** 
## creditcard_rate  -0.2132910  0.3148222  -0.677  0.49809    
## wifi_rate        -0.4254220  0.2421754  -1.757  0.07897 .  
## TV_rate          -0.0009027  0.2227967  -0.004  0.99677    
## seating_rate     -0.0351121  0.2273668  -0.154  0.87727    
## alcohol_rate     -1.5575309  0.3197722  -4.871 1.11e-06 ***
## noise_rate       -0.9247995  0.4597414  -2.012  0.04427 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 802.63  on 999  degrees of freedom
## Residual deviance: 649.24  on 988  degrees of freedom
## AIC: 673.24
## 
## Number of Fisher Scoring iterations: 6

To verify the validity of our logistic regression model, we ran a Hosmer and Lemeshow Goodness of Fit test

# Verify Logistic regression model
library(ResourceSelection)
## ResourceSelection 0.2-6   2016-02-15
hoslem.test(data3$log_rate, fitted(logit))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  data3$log_rate, fitted(logit)
## X-squared = 8.3447, df = 8, p-value = 0.4005

Our goodness of fit test had a P-value > 0.05. Therefore, we do not have sufficient evidence to reject the null hypothesis that the prediction is the same as the observations. Therefore, our model is valid.

For the logistic regression model, we observed that the following variables were significant predictors of a >=4.5 star rating:

  • The page number it appears in
  • The total number of user reviews
  • Whether it provides takeout service
  • Whether it serves alcohol
  • Whether is is reported as noisy

To see how our model fits to our training (Boston) data, we wanted to see how accurate our prediction is.

# Testing logistic Regression Model on Boston data
data3$predictlog <-predict.glm(logit)
data3$log_rate_prediction <- ifelse(data3$predictlog < 0.5,0,1)
data3$log_test <- ifelse(data3$log_rate_prediction == data3$log_rate,"Correct","Wrong")
table(data3$log_test)
## 
## Correct   Wrong 
##     866     134

Using our original training dataset, our model predicted 86.6% of the binary ratings correct (either < or >=4.5 stars).


Testing the Model with a New Dataset

Now that we created a model using linear and logistic regression methods, we will now test our model on a test dataset.

We have scrapped another 201 restaurants from Cambridge with which we will test our model. The data were wrangled in the identical fashion as the Boston restaurants (not shown). We also added coordinates data as well using the above functions.

Location data codes (not run): data_cam <- data_cam %>% mutate(lat=get_lat(as.character(address)), lon=get_lon(as.character(address))) write.csv(data_cam, “Cleandata_cambridge_coor.csv”, row.names=FALSE)

# Here we will obtain the wrangled data from a .csv file
data_cam <- read.csv("Cleandata_cambridge_coor.csv", comment.char="#")

We will do a quick exploration of the new test dataset.

# Summarize the data
summary(data_cam$yelp_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.500   4.000   3.841   4.000   5.000
count(data_cam, yelp_rating)
## Source: local data frame [5 x 2]
## 
##   yelp_rating     n
##         (dbl) (int)
## 1         3.0    14
## 2         3.5    62
## 3         4.0   103
## 4         4.5    18
## 5         5.0     4
# Examine attributes by Yelp rating
table2 <- data_cam %>% group_by(yelp_rating) %>% summarise(page=mean(page),raters=mean(numberrate),takeout=mean(takeout_rate),alcohol=mean(alcohol_rate),noise=mean(noise_rate)) %>% ungroup()
kable(round(table2, digits=2)) 
yelp_rating page raters takeout alcohol noise
3.0 19.86 317.79 0.86 0.86 0.29
3.5 15.74 321.16 0.92 0.71 0.13
4.0 9.27 269.21 0.87 0.52 0.05
4.5 7.89 117.56 0.72 0.39 0.06
5.0 11.50 22.00 0.50 0.25 0.00

The trends in our new Cambridge dataset is the same as we observed in the Boston dataset.

Testing linear regression model

We will test our linear model against new Cambridge dataset.

data_cam$predict <-predict(fit,newdata =data_cam )
plot(data_cam$rate,data_cam$predict, main = "Cambridge Prediction vs. Obervation (Linear Model)", xlab = "Rate_Observation", ylab = "Rate_Prediction")

cor(data_cam$rate,data_cam$predict)
## [1] 0.5026383

Our predicted continuous rating data has a correlation of 0.503 with the observations.

Testing logistic regression model

Let’s now test our logistic regression model against new Cambridge dataset.

# Generating predictions
data_cam$predictlog <-predict.glm(logit,newdata =data_cam)

# We will use a cut off of 0.5. If predicted value < 0.5, we designate the predicted outcome as 0, and if the predicted value is >= 0.5 we designate the predicted outcome as 1. 
data_cam$log_rate_prediction <- ifelse(data_cam$predictlog < 0.5,0,1)

# Now let's check to see if our predictions were correct
data_cam$log_test <- ifelse(data_cam$log_rate_prediction == data_cam$log_rate, "Correct", "Wrong")
table(data_cam$log_test)
## 
## Correct   Wrong 
##     185      16

We had an accuracy of 92.03% with our logistic regression model!!

Visualizing our predictions on map

To see the power of our models in predicting Yelp ratings, we will plot the results on a map.

Let’s first map all the restaurants in Cambridge. Colors are different based on rating

map <- get_map(location='cambridge', source="stamen", maptype="terrain", zoom=13)
ggmap(map) + geom_point(aes(x=lon, y=lat), data=data_cam, color=data_cam$yelp_rating)

Next we will map the restaurants by whether we our model correctly predicted whether the rating was >=4.5 stars.

map <- get_map(location='cambridge', source="stamen", maptype="terrain", zoom=13)
ggmap(map) + geom_point(aes(x=lon, y=lat), data=filter(data_cam, log_test=="Wrong"), color="red") + geom_point(aes(x=lon, y=lat), data=filter(data_cam, log_test=="Correct"), color="green")

Looks good!

Conclusion

In conclusion, we started this project wanting to understand the contributors to Yelp user ratings and find ways to predict ratings through modeling.

We scrapped and wrangled the data of a total of 1,021 restaurants in Boston and Cambridge and explored the data. We also visualized how the restaurant ratings were distributed geographically in Boston.

Next, we constructed a linear model to predicted the continous rating score from Yelp and was able to obtain a 0.634 correlation to training (Boston) data and 0.503 correlation to the test (Cambridge) data. Our explanatory variables account for 40.3% of the variations in the data.

Finally, we contructed a logistic regression model to predict the calculated yelp categorical scores (rounded to 0.5), and we obtained 86.6% accuracy with training data and 92.03% accuracy with the test data.

Discussion

Through our logistic model, a new restaurant is able to predict whether it will be rated above 4.5 with approximately 90% accuracy by simply using its attribute data. We offered a powerful way for restaurants to examine how they could potentially improve their using ratings by changing their attributes such as whether to serve alcohol or have a noise ambience.

One of the key limitation we saw was the role of page number and total number of raters. Page number was a significant predictor of Yelp ratings, which we hypothesis is due to Yelp displaying higher rated restaurant earlier in its search algorithm. Thus is would be a major confounder in our explanatory model.

We have also observed that restaurants with higher number of raters tend to have medium ratings, which suggests that taking a greater sample size will produce closer to average results.

Lastly, we were not able to obtain price data and did not use open hour data in our analysis. We believe incorporating these data points could improve the explanatory power of our model.

Acknowledgement

Thank you very much for a great course. We learned a lot, and we appreciate the hard work the course staff have devoted to educate us in the magic of data science.

Zhi Dong and Shihao Zhu